Imputation method for single-cell RNA-seq data using neural topic model

Abstract Single-cell RNA sequencing (scRNA-seq) technology studies transcriptome and cell-to-cell differences from higher single-cell resolution and different perspectives. Despite the advantage of high capture efficiency, downstream functional analysis of scRNA-seq data is made difficult by the excess of zero values (i.e., the dropout phenomenon). To effectively address this problem, we introduced scNTImpute, an imputation framework based on a neural topic model. A neural network encoder is used to extract underlying topic features of single-cell transcriptome data to infer high-quality cell similarity. At the same time, we determine which transcriptome data are affected by the dropout phenomenon according to the learning of the mixture model by the neural network. On the basis of stable cell similarity, the same gene information in other similar cells is borrowed to impute only the missing expression values. By evaluating the performance of real data, scNTImpute can accurately and efficiently identify the dropout values and imputes them accurately. In the meantime, the clustering of cell subsets is improved and the original biological information in cell clustering is solved, which is covered by technical noise. The source code for the scNTImpute module is available as open source at https://github.com/qiyueyang-7/scNTImpute.git.


Introduction
Bulk-cell RNA sequencing (RNA-seq) techniques have been widely used for transcriptome analysis to study transcriptional structure, splicing patterns, and expression levels of genes and transcriptomes [ 1 ].To address biological issues such as cell heterogeneity and gene expr ession r andomness, it is particularl y important to inter pr et cell-specific transcriptome landscapes [ 2 ].Although the bulk-cell RNA-seq technique is popular, it measures the average expr ession le v el of genes in batch cells, and the expression of variable genes will be pulled to av er a ge.Ther efor e, it is not possible to study cell specificity based on tr anscriptomics.Fortunatel y, by studying gene expression status in single cells, scRNA-seq technology overcomes the shortcomings of tr aditional batc h cell sequencing technology and is becoming a po w erful tool to capture the intercell variability of the transcriptome.It has dramatically changed the study of transcriptomics, helping us to decode life from a higher-resolution and spatiotemporal structure, accurately reflecting the heterogeneity between cells .T he study of scRNAseq data has become a hot subject today.
Curr entl y, we use multiple single-cell RNA-seq (scRNA-seq) platforms, the two most popular being Fluidigm and Drop-Seq.The Dr op-Seq pr ocesses thousands of cells in a single run, which not just saves time and cost but also is simple to operate.Fluidigm, while it usually processes fewer cells, has higher cov er a ge r ates.So, an increasing number of studies are using these techniques to discov er ne w cell types [ 3 , 4 ], new markers for specific cell types [ 3 , 5 , 6 ], and cell heterogeneity [6][7][8][9][10][11].
Ho w e v er, scRNA-seq tec hnology has its corr esponding dr awbacks .scRNA-seq data ha ve a relatively higher noise level than batch cell RNA-seq data, resulting in a major problem that is the sparsity of the gene expression matrix (i.e., the data often exhibit a large number of zero values) [ 12 ].Most of these zer os ar e artificially caused by defects in sequencing techniques, including, but not limited to, inadequate gene expr ession, low ca ptur e r ates and sequencing depth, or other technical factors [ 13 , 14 ].As a result, the observ ed zer o v alue does not r eflect the underl ying true expr ession le v el [ 15 , 16 ].This gene expression bias may be further increased during subsequent amplification steps .T hus , dropout e v ents can significantl y affect downstr eam bioinformatics anal ysis.At pr esent, r esearc hers hav e pr oposed a v ariety of imputation models through different principles and methods [ 17 , 18 ].These r esearc h r esults hav e a gr eat guiding r ole in scRNA-seq data integr ation, enric hment anal ysis, and so on.According to the design c har acteristics of the imputation algorithm, as well as the data feature learning and processing methods, we roughly divide the RNA-seq data imputation methods into two categories: deep learning-based imputation method and non-deep learning imputation method [ 19 ].
In the traditional non-deep learning imputation algorithm, because of its simple idea, it is able to usually fit the corresponding statistical probability model or use the expression matrix for smoothing and diffusion.So, there are certain advantages in some specific types of samples.Wagner et al. [ 20 ] used the k-nearest neighbors (KNN) smoothing method by finding k -nearest neighbors between cells and a ggr egating gene-specific UMI (Unique Molecular Identifiers) counts to impute the gene expression matrix.In finding the number of nearest neighbors k , instead of using a way to fit a certain model, the data's imputation is ac hie v ed stepwise by constructing a partially smoothed profile with a variancestabilizing transformation.Li and Li [ 21 ] introduced a statistical method, scImpute, which uses a mixture model to learn the loss pr obability of eac h gene in eac h cell.By setting a loss pr obability threshold, the input data are divided into two parts: the set of genes se v er el y affected by "dr opout" A j and the set of unaffected genes B j .Ev entuall y, the information on similar cells is learned from B j for imputation.scImpute automatically identifies possible dr opout v alues and performs imputation onl y on these v alues without introducing new biases to the rest of the data.Huang et al. [ 22 ] proposed the SVAER algorithm, which is a method that uses information across genes and cells to impute zero values so as to optimize the expression of all genes.By looking for potential relationships between genes, the true expression level of each gene in each cell can be r estor ed, eliminating tec hnical differ ences.Ne v ertheless, SVAER alters all gene expr ession le v els, including those not affected by dr opout e v ents, whic h could intr oduce ne w biases into the data and potentiall y eliminate biologicall y significant v ariation.For scRNA-seq data that ar e lar ge, often high-dimensional, sparse, and complex, anal ysis using tr aditional computational methods becomes difficult and infeasible [ 23 , 24 ].
As deep neural network algorithms have gained great application in biomedical fields in recent years, they mine complex relationships within single-cell data through a series of basic hier arc hical oper ations .T he typical deep learning algorithms applied to scRNA-seq data are autoencoders (AEs), variational autoencoders (VAEs), gener ativ e adv ersarial networks (GANs), and other models.Eraslan et al. [ 25 ] proposed the deep count autoencoder (DCA) network model by improving the conventional autoencoder.The reconstruction error is defined as the probability of the noise model distribution rather than the reconstruction of the input data themselves.Gene-specific distribution parameters are learned by minimizing reconstruction errors in an unsupervised manner.The noise model is e v entuall y a pplied to sparse count data, giving it a loss function specifically for scRNA-seq data.Meanwhile, its deep learning fr ame work is capable of capturing the complexity and nonlinearity of scRNA-seq data and is highly scalable.Arisdakessian et al. [ 26 ] proposed a deep neural network-based imputation algorithm (DeepImpute) by constructing m ultiple subneur al networks, whic h impute genes in a divide-and-conquer manner, not onl y ac hie ving the highest overall accuracy but also providing faster computing time and requiring less memory.Xu et al. [ 27 ] proposed a scRNA-seq data imputation method (scIGANs) founded on gener ativ e adv ersarial network.The method uses networks to generate cells rather than cells observed in the original matrix to balance the performance between dominant and r ar e cell populations, enabling it to learn nonlinear gene-to-gene dependencies from complex samples of multicellular types and train generative models to generate realistic expression profiles of defined cell types.After training, KNN is used to impute the same type of cells, thereby eliminating technical variations without damaging intercell biological variability.This method is robust to small data with low expression or intercell differences [27][28][29].
Most downstr eam anal yses of scRNA-seq, suc h as differ ential gene expression analysis, cell-type specific gene identification, and new cell-type definition, r el y on the accuracy of gene expr ession measur ements .T her efor e, it is particularl y important to correct the expression of "false zero values" caused by dropout e v ents in scRNA-seq data through accurate and robust imputation methods [ 21 ].These imputation methods identify the dropout values in scRNA-seq data from different perspectives and impute them.Ho w e v er, for non-deep learning, it is impossible to effectiv el y learn the feature relationship of some complex nonlinear data, and it does not have good flexibility and expansibility.The arc hitectur e of deep learning itself is a "blac k box," with man y learning layers and thousands of nodes, making the underlying features learned and the full rich potential of the single-cell dataset unleashed uninter pr etable [ 30 ].
Although the study of RNA-seq data is an active area of researc h, accur ate r ecov ery of single-cell gene expr ession data r emains a great challenge.Inspired by the neural topic, we have designed an accurate and stable imputation method, called sc-NTImpute, that can mor e pr ecisel y impute gene expression affected by dr opout.Specificall y, scNTImpute performs deep featur e extraction and the construction of networks of encoders through the coding learning mechanism of transferable neural networks, learning network parameters and highly interpretable mixtures of cell topic from scRNA-seq data.Topic features can be used to learn the similarity of cells, and r esearc hers ar e ca pable of performing topic pathway enrichment analysis on them at a later stage .T his is used to explore whether they have relevance to curr entl y known gene pathwa ys , as well to uncover topics that may be condition specific or cell type specific to impr ov e the inter pr etability of the deep featur e fr om a biological perspective.Concurr entl y, we will get underlying connections such as cell to cell, cell to gene, or gene to gene in single-cell data.The flexibility of the neural topic model makes it excellent for processing scRNAseq data.Besides, scNTImpute uses neural networks to learn the mixture model parameters of gene expression distribution, solving the dr opout pr obability of each gene in each cell.This allows us to mor e dir ectl y understand the true state of the expression data of the scRNA-seq transcriptome and distinguish which gene tr anscripts ar e affected by dr opout.Using information about the same gene in other similar cells allows us to impute the dropout value in a cell through underlying cell-gene connections.Prior to this, it has been important that the borro w ed information is selected for genes that are as free as possible from dropout events.

scNTImpute model o vervie w
We propose a new scRNA-seq data imputation method on account of a neural topic model that is adapted from the singlecell embedded topic model (scETM), which inherits the advantages of topic modeling and is very effective in dealing with heavytailed and large distributions of w or d frequencies [ 31 , 32 ].For the analysis of the scRNA-seq data study, we pass the sampled cell and tr anscriptome expr essions separ atel y as v ectors of normalized counts to two fully connected neural networks (i.e ., tow-la yer fully connected encoders).First, using a fully connected neural netw ork encoder, w e infer the topic mixing ratio of cells, namely, the cell-topic mixture (Fig. 1A ).Second, we use the second neural network to infer the mixed distribution parameters of the transcriptome and obtain probability estimates of whether the gene expr ession v alue in eac h cell is a dr opout v alue by using the mixed distribution model.Finally, the cell-topic mixture infers similar cells of the cell in which the dropout gene is located and is the same genetic information from similar cells for the imputation of dr opout v alues (Fig. 1 B).
We performed imputation experiments on another published real dataset, Chung et al. [ 34 ].T he abo ve imputation indexes are not the only criteria for e v aluating the imputation of RNA-seq data.Differ ent fr om the abov e, the other two imputation indexes have been used for e v aluation (cosine similarity [CS], Fowlkes-Mallows scor e [FMS]).Similarl y, we compar ed it with se v er al other existing excellent imputation models .T he e v aluation r esults wer e visualized (Fig. 4 ), from which we could see that our model performed the best in both CS and FMS.Especially in FMS, a large gap was drawn with other imputation methods .(T he specific imputation comparison r esults ar e shown in Table 2 , and Fig. 5 shows the clustering effect on the complete Chung et al. [ 34 ] data set).
To more effectively validate the robustness and stability of sc-NTImpute, as well as highlight the strengths of our model, we applied scNTImpute to more diverse real scRNA-seq datasets.In addition to the existing comparison methods mentioned abo ve , we included se v er al mor e adv anced imputation methods for comparison (i.e., AE-TPGG [ 44 ], scGNN [ 45 ], scISR [ 46 ], scScope [ 47 ]).Besides, since the cell-cell distance matrix in MAGIC is based on Euclidean distances, the added MAGIC-C method is based on counting the data to understand how the form of the data affects the imputation.For the convenience of comparison and differentiation, the original MAGIC method based on normalized data is referred to as MAGIC-N [ 44 ].First, we applied the model to a temporal scRNA-seq dataset involving mouse preimplantation embryonic de v elopment data (Deng et al. [ 48 ]).The Deng et al. [ 48 ] dataset includes single cells from 10 early mouse developmental stages, including zygotes; 2-, 4-, 8-, and 16-cell stages; and blastocysts [ 39 ].We compared scNTImpute with a new imputation method called AE-TPGG on this dataset, using the aforementioned e v aluation metrics (ARI, NMI).We visualized the imputation results (Fig. 6 ), from which we could observe that scNTImpute ac hie v ed the highest scores in both metrics, follo w ed b y our ne wl y added AE-TPGG imputation method (Fig. 7 shows the clustering   [ 50 ]), we compared three new imputation methods.scScope is a scalable deep learning-based a ppr oac h.The scGNN employs a gr a ph neur al network, whic h pr ovides a hypothesis-free deep learning framework for scRNA-seq analysis.In contrast, scISR is a single-cell imputation method that utilizes subspace r egr ession.We a pplied these thr ee ne w methods alongside our model to these two real datasets .T he evaluation was performed using the same metric, ARI.To visualize the results more intuitiv el y, we plotted the imputation results of these methods (Figs. 8 and 9 r epr esent the imputation results for the HP dataset [ 49 ] and Romanov et al. [ 50 ] dataset, r espectiv el y).Fr om the figures, it is evident that scNTImpute ac hie v ed ARI scor es close to 0.7 on both datasets, outperforming the other imputation meth-ods (ac hie ving the highest ARI).This further confirms scNTImpute's effectiveness in recovering true biological information from sparse single-cell data.

scNTImpute improves the clustering of cell subpopulations
To test the ability of scNTImpute to impr ov e cell type or cell subgr oup clustering, we a pplied scNTImpute to r eal scRNA-seq (i.e., also on the Chung et al. [ 34 ] dataset).In addition to reusing the above ARI and NMI e v aluation indexes, we also adopted another commonly used clustering index, adjusted mutual information (AMI).We imputed scRNA-seq data with scNTImpute and other different imputation models, as well as compared cell clustering with complete imputation data.Through the comparison of evaluation data (refer to Table 4 for specific data and Fig. 10 for visualization of compar ativ e data), our imputation method was the highest in the ARI index and had r elativ el y significant and stable performance in the AMI and NMI clustering indexes (Fig. 11 shows the clustering effect after imputation).
Mor eov er, we e v aluated the clustering effect of scRNA-seq data after imputation on another Hrvatin real data set [ 3 ].We used the cell type stated in the original publication as the basic fact and ARI as a performance indicator.Unlike the pr e vious comparison, here we combined the scNTImpute with other de v eloped imputation models and clustering methods to e v aluate; that is, before using the clustering algorithm, we used other imputation models to process and compare the results with our model.Se v er al excellent clustering algorithms, such as pcaReduce [ 51 ], SC3 [ 52 ], and t-SNE [ 53 ], follo w ed b y k-means (t-SNE/kms), w ere used to cluster scRNA-seq data.These methods did not explicitl y addr ess the dr opout e v ents in scRNA-seq data.Ther efor e, in model comparison, there w ere tw o assumptions: (i) preprocessing of dropout event RNA-seq data by other imputation algorithms, which would improve the accuracy of these clustering methods, and (ii) comparison between scNTImpute and existing splendid imputation algorithms, such as DrImpute [ 39 ], CIDR  [ 54 ], scImpute [ 21 ], and MAGIC [ 37 ].scNTImpute performs better in handling dropout events to improve clustering performance (Fig. 12 shows the visualization of e v aluation data; see Table 5 for specific data).We can clearly see the experimental comparison of 5 imputation methods and individual imputation methods combined with clustering algorithms.We found that the effect of scNTImpute was significantly better than the clustering en-hancement performance of CIDR, follo w ed b y the SC3 + DrImpute (Fig. 13 ).

Transfer learning across single-cell datasets
A pr ominent featur e of scNTImpute is its par ameters, so the knowledge of modeling scRNA-seq data can be tr ansferr ed acr oss  [ 32 ] were used to conduct cr oss-species tr ansfer learning of sc-NTImpute models.Both datasets were obtained using the inDrop method (a droplet-based single-cell RNA-seq sequencing technique).The assumption of transfer learning is that the distribution of data in the source domain should be similar to the distribution of data in the target domain.T herefore , we primarily demonstrate the similarity of the datasets used in the transfer learning section from two perspectives.First, mean, variance, and standard de viation ar e commonl y used statistical measur es to anal yze the c har acteristics and similarities of datasets.If their mean is close and their variance and standard deviation are similar, then their similarity is higher.Conv ersel y, if these measures differ significantly, their similarity is lo w er.By analyzing the calculations, the results of the evaluation metrics for the two datasets are shown in Table 6 .We also visualize the data from Table 6 (Fig. 14 ), which provides a more intuitive way of observation.We find that their values for all three statistical indicators are very close.Especially in terms of standard deviation, the two datasets have almost the same le v el of dispersion.Additionall y, b y comparing the mean, w e can see that the central tendencies of the two datasets are also very similar.Second, the probability distribution function is a function used to describe the probability of the possible values of a random variable.It can also help us understand and analyze similarities between datasets.For discrete data such as scRN A-seq datasets, w e can calculate the frequency of occurrence for each value and divide these frequencies by the total size of the dataset to obtain the pr obability of eac h v alue.To facilitate the intuitiv e anal ysis of the probability distribution functions of the tw o datasets, w e visualized the two probability distributions (Fig. 15 ).Due to the sparsity of the original scRNA-seq datasets, the frequencies of zero values ar e r elativ el y high in both datasets.By visualizing the pr obability distributions, we can easily observe that the distributions of the two datasets are very similar.
Next we conducted r esearc h on transfer learning.First, if the HP dataset was dir ectl y tr ained on the model, the 4 imputation indicators were ARI, 0.681; NMI, 0.751; RI, 0.884; and MI, 1.429.Second, we trained a scNTImpute model on the MP dataset and used the trained model to impute and e v aluate HP data.Ultimately, an exciting transfer learning effect was produced (ARI r eac hed 0.858 in the HP dataset; refer to Table 7 for other specific results).In order to verify the stability of the model transfer learning, we conducted the transfer learning from the HP dataset to the MP dataset, and the r esults wer e also sur prising.The r esults of direct imputation and transfer learning imputation of the HP data set were visualized by UMAP (Fig. 16 ).After transfer, scNTImpute impr ov ed man y indicators and learned to be cell type-specific (T able 7 , Fig. 16 ).T o compare with other methods, we used scNTImpute, scVI-LD, and scVI to e v aluate clustering performance in transfer learning tasks.Clustering performance was mainl y measur ed b y the ARI betw een real cell types and Leiden [ 43 ] clusters.Ov er all, scNTImpute obtained the best learning results in cross-species transfer learning between HP and MP (Table 8 ).

Path enrichment analysis and statistical significance test of scNTImpute topics
We next investigated separately whether the topics of scNTImpute wer e biologicall y r ele v ant to known human genetic pathways and whether ther e wer e differ ences between topics.First, w e emplo y ed gene set enric hment anal ysis [ 55 ] for explor ation.We tr ained a scNTImpute with 50 topics using the HP dataset.For the obtained topics, we detected a number of significantl y enric hed pathwa ys .Se v er al of these wer e r elated to pancreatic function, including the insulin receptor recycling, cardiac my oc yte insulin receptor signaling pathw ay, insulin signaling pathwa ys , pancreatic cancer, and so on (Fig. 17 , Fig. 18 ).The set of contained genes between the black bar in the middle and the highest point (ES value) is called the leader subset, which contributed to the upregulation of the entire pathwa y.Furthermore , based on the differences in the enrichment levels of topics in the pathw ay, there w ere also significant differences between topics.We calculated P value and fold change (FC), and converted P value as a negative logarithm to −log 10 ( p − value ) , while fold change was logarithmically converted to log ( F C ) .(The red line in Fig. 19 is the threshold line for i p /i < 0 .01 .)In general, by taking the negative logarithm of the P values, most of our topics are smaller than the set thr eshold, whic h shows that there are significant differences in our topics (Fig. 19 ).

Scalability and efficiency
To validate the scalability and efficiency of the proposed sc-NTImpute for tr ansfer able learning, we tested it on datasets with a different number of genes and recorded the runtime.Specifically, we took the trained model directly for imputation on additional datasets.We performed imputation on datasets containing 1k, 2k, 5k, 10k, and 15k genes and inv estigated the r elationship between running time and the number of genes (Fig. 20 ).The running time of imputation exhibited a linear increase relative to the number of genes.As scNTImpute is a neural topicbased imputation method, its runtime increases with the number of genes.In practical applications, the number of genes and cells is limited.So, scNTImpute is more suitable for scRNA-seq datasets than other imputation methods.

Workflow
We adopted an imputation w orkflo w based on a neural topic model, implemented using the PyTorch ( RRID:SCR _ 018536 ) dynamic fr ame work on the bac k end.Our work is di vided into the following steps.

Data preprocessing
We took as input the scRNA-seq gene count expression matrix X, wher e the r ows r epr esented cells and the columns r epr esented genes.Data filtering and quality control were performed as a pre-vious step of data pr epr ocessing, and we e v entuall y wanted to get an imputation matrix with the same dimensionality as the original count matrix.To facilitate the subsequent w ork, w e first normalized each sample (cell) and each gene in the matrix separ atel y Figur e 7: T he clustering of the Deng et al. [ 48 ] dataset after imputation using scNTImpute .To avoid infinite values of the parameters in later model training, we added pseudo-counts to it.The adv anta ge of logarithmic transformation is that it can pr e v ent some lar ge observ ations fr om having a significant impact, eliminate heteroscedasticity issues, and tr ansform the v alues into continuity, pr oviding gr eater flexibility for modeling.

Topic gener a tion process
Inspired by scETM's research on single-cell transcriptomics [ 31 ], we adopted a neural topic model to model the scRNA-seq data distribution [ 57 ].We treated each cell as a document, and each scRNA-seq read (or UMI) served as a marker in the document.The gene that generates the read count (or UMI) is thought of as a w or d in a vocabulary.We assumed that each cell could be r epr esented as a mixture of underlying cell types, and they were often referred to as potential topics.In the original LDA model [ 57 ], a fixed set of N-independent Dirichlet distributions β was defined, distributed over a vocabulary of size M .Formally, the cell-topic mixture gener ation pr ocess is as follows.
Obtaining the potential topic proportion of cell C from a logical normal distribution: To simulate the sparsity of gene expression in each cell, the softmax function was used to regulate the expression of all genes.For this purpose, we obtained a mixture of all the cell topics θ [ 31 ].
To enable the extraction of latent topics from the data, in sc-NTImpute, we consider the latent cell topic mixture θ C as the unique latent variable for each cell C. We denoted the posterior distribution of latent variables as p( δ| Y C )( Y C (normalized gene expression matrix above).Ho w ever, directly solving for the true posterior distribution in high-dimensional space is computationall y c hallenging.Ther efor e, a v ariational infer ence method w as emplo y ed to a ppr oximate the true posterior distribution by minimizing the difference between q( δ c ) and p(δ| Y C ) , and the variational posterior was easier to compute compared to the true posterior.Specifically, we defined the following distribution: c ; W θ ) .This network a ppr oximated the complete statistics of the cellular topic mixture δ c .To learn the variational parameter W θ mentioned abo ve , we optimized the evidence lo w er bound (ELBO) for the logarithmic likelihood [ 31 ].This involved minimizing the Kullback-Leibler (KL) divergence, which measures the difference between the approximate posterior distribution and the true posterior distribution.The goal is to reduce this div er gence and bring the a ppr oximate posterior closer to the true posterior: The first term r epr esents the likelihood function, which is measur ed using negativ e log-likelihood.The second term, the r econstruction likelihood, is a regularization term involving the KL div er gence between the a ppr oximate distribution ( q ( δ c | y c ) = N ( μ c , d iag( σ c ) ) ) and the prior distribution ( p( δ c ) = N( 0 , I ) ), which encour a ges the v ariational posterior distribution to a ppr oximate the prior distribution.We sampled a few latent variable samples fr om the r epar ameterized Gaussian distribution q ( δ c | y c ) , wher e their mean and v ariance wer e determined by the aforementioned NNET (Double lay er feedforw ar d neural netw ork for estimating sufficient statistics of the proposed distribution of cell topic mixtures).These samples served as noise estimates for ELBO [ 31 ].Ultimatel y, the gr adients wer e bac kpr opa gated to optimize the weights of the encoder to ac hie v e the goal of extracting topic features.

Study dropout values
After acquiring the transformed gene expression matrix Y, we can infer which genes in the cell were affected by the dropout event.
Instead of considering all zer o v alues as dropout values, we used a neural network to systematically determine whether zero values wer e dr opout v alues.First, the normal distribution described continuous data, while the reads count (gene expression) data were discr ete.Second, the r eads count data could only take values that wer e nonnegativ e integers, and for scRNA-seq data, the most commonly used normal distribution was not r easonable.Certainl y, the zer o-inflated negativ e binomial distribution has pr ov en to be a good model for describing scRNA-seq data and serves as the basis for some outstanding models.With the presence of dropout e v ents, most genes have bimodal expression patterns in similar cells.We adapted the mixture model used in scImpute [ 21 ].The similar mixture models have been shown to effectiv el y ca ptur e the bimodal features of single-cell gene expression data [ 32 , 56 , 45 ], where the Gamma distribution represents the dropout phenomenon, and the Normal distribution is used for indicating actual gene expression.It is important to note that the transformed gene expression levels are no longer integers, so the widely used read counts obeying a negative binomial distribution are not an a ppr opriate c hoice.For eac h gene, the pr oportions and par ameters of the two components may be distinctive in different cell types.As a result, we assume that the expression level of each gene j is a random variable Y j following a Gamma-Normal mixed distribution, with a density function of [ 21 ]: where λ j is the dropout rate of genes, the α j and β j are the shape and rate parameters in the Gamma distribution, and μ j and σ j are the mean and standard deviation in the normal distribution, r espectiv el y.When a sequencing experiment fails to accur atel y ca ptur e the transcriptional expression of genes, the Gamma distribution models the observed gene expression, while the Normal distribution simulates the actual gene expr ession le v el.The intuition behind this mixture model is that if a gene has high expression and low variation in multiple cells, the "zero" count expression is more likely to be affected by dr opout e v ents; on the other hand, if a gene has consistently low or moderate expression and high variation, then the zero counts may reflect the true biological significance.
After a given distribution of the mixture model, the loglikelihood of each gene at all cell expression levels can be calculated as l( λ j , α j , β j , μ j , σ j ) = n i =1 log f Y j ( y i j ; λ j , α j , β j , μ j , σ j ) [ 56 ].
The parameters in the model shown in Equation ( 1) are calculated by a neural network, and these estimates are denoted as λ j , α j , β j , μ j , σ j .We can filter the gene expr ession v alues based on the undetected probability of the gene in the cell [ 56 ], and the Figur e 13: T he clustering effect after imputing Hrvatin using scNTImpute .dr opout r ate of gene j in cell i can be computed as: Because d i j ∈ ( 0 , 1 ) , a smaller d i j indicates that the observed gene expression Y i j has higher confidence.We set the threshold t by which the dropout rate d i j < t is considered an accur ate measur e with high confidence, and when the dropout rate d i j ≥ t, then gene expression Y i j is considered a dropout value.

Imputation
To impute the dr opout v alues accur atel y, we needed to borr ow expression data of gene j in other similar cells that were not affected by dropout to fill in.Specifically, on the basis of the above obtained cell-topic mixture θ , the essence of which was also the dimensionality reduction of scRNA-seq data while effectively reducing the impact of most dropouts in the data, we calculated the similarity matrix of cells, where each element means how similar the cell is to other cells .T he degree of similarity of cell i and other cells is calculated as follows: where Z indicates the degree of similarity between cell i and cell i .A lar ger v alue indicates that two cells are less likely to belong to the same type of cell and less similar, and a smaller value indicates greater similarity (excluding the degree of similarity with itself, i.e., 0 value).In compliance with the similarity between cells, w e can borro w the same nondropout gene expression data from similar cell i of cell i for the imputation of dropout genes in cell i .X i j = X i j , (d i j < t)

Imputa tion e v alua tion
To benchmark the imputation performance, we compared several scRNA-seq data imputation tools that were identical to scNTImpute.We utilized the original dataset for the e v aluation experiments.After the imputation of the original data was completed, 4 leading e v aluation indicators, ARI, RI, NMI, and MI, were utilized for the e v aluation.ARI is inter pr eted as RI is inter pr eted as   where a indicates the correct number of markers of cells that should have been of the identical type and after clustering were also in the same type, b r epr esents the correct number of markers of cells that were not of the same type and did not cluster to the identical type after clustering, and C 2 n r epr esents the total number of possible pairs.E[ RI ] is the expected RI of the random markers [ 45 ].
NMI is explained as

F igure 1 :
Workflo w of scNTImpute.(A) scNTImpute uses a neural topic netw ork ar c hitectur e to model the single-cell tr anscriptome.Normalized counts of the gene expression data matrix and its transpose matrix for each single-cell dataset are used as input to the encoder.The encoder network gener ates r andom samples of potential cell-topic mixtur es ( θ d , cells d = 1, …, N ) that can be used to compute intercell similarity.Neural networks learn the parameters of a mixture model of gene expression data and can be used to identify dropout values.(B) Imputation works using similar cell information.A cell similarity matrix is generated by calculating the intercellular similarity from the resulting mixture of the cell topic.In view of the learned parameters of the mixture model, the dropout value is identified and imputed with the information of similar cells (cell j ) of the cell where the dr opout v alue is located (cell d ).(C) Tr ansfer learning w orkflo w.The scNTImpute model tr ained on the r efer ence scRNA-seq dataset can infer the mixture of cell topic θ and the mixture model distributions from the unseen scRNA-seq dataset and perform accurate imputation on the unseen dataset.The scRNA-seq dataset is visualized by UMAP and e v aluates by standard unsupervised clustering metrics using real cell types.

Figure 2 :
Figure 2: Visualization of human brain scRNA-seq data imputation metrics results.

Figure 3 :
Figure 3: Clustering of human brain scRNA-seq data after imputation by scNTImpute.

Figure 4 :
Figure 4: Comparison of imputation results of different methods for Chung et al. [ 34 ] data.

Figure 6 :
Figure 6: Comparison of results after imputation of Deng et al. [ 48 ] data by various methods.

Figure 8 :
Figure 8: Results of different imputation methods on the human islet dataset [ 49 ].

Figure 9 :
Figure 9: Results of different imputation methods in the Romanov et al. [ 50 ] dataset.

Figure 10 :
Figure 10: Effectiveness of different imputation methods in clustering Chung et al. [ 34 ] data.

Figure 14 :
Figure 14: Comparison of similarity indexes (mean, variance, standard deviation) between mouse islet data and human islet data.

Figure 15 :
Figure 15: Probability distribution functions for mouse islet data and human islet data.

Figur e 16 :
Figur e 16: T he tr ansfer learning performance of scNTImpute: tr ansfer learning fr om mouse pancr eatic islet data to human pancr eatic islet data (MP-HP) and human pancreatic islet data transfer learning to mouse pancreatic islet data (HP-MP).

Table 8 :Figur e 17 :
Figur e 17: T he topic of human islet data is enriched in the insulin r eceptor r ec ycling pathw ay.

Figur e 18 :
Figur e 18: T he topic of human islet data is enriched in the cardiac my oc yte insulin receptor signaling pathway.

Figure 19 :
Figure 19: Differences in the topic matter of human islet data.

Figure 20 :
Figure 20: Time complexity analysis of scNTImpute transfer learning.

Table 1 :
Multiple metrics used to measure the imputing output of multiple imputation methods on real scRNA-seq data from the human brain

Table 2 :
[ 34 ]ng output of multiple imputation methods on real Chung et al.[ 34 ]scRNA-seq data using different metrics

Table 3 :
[ 48 ]ement of the imputation results of multiple methods on real Deng et al.[ 48 ]scRNA-seq data.

Table 4 :
Clustering performance comparison of multiple methods in Chung et al. [ 34 ] data

Table 5 :
Clustering output of multiple imputation methods on real Hrvatin data

Table 6 :
Three metrics used to measure the similarity of the data used for transfer learning

Table 7 :
Using multiple metrics to measure scNTImpute's transfer learning performance